Multi-Modal Conversational Search and Browse
نویسندگان
چکیده
In this paper, we create an open-domain conversational system by combining the power of internet browser interfaces with multi-modal inputs and data mined from web search and browser logs. The work focuses on two novel components: (1) dynamic contextual adaptation of speech recognition and understanding models using visual context, and (2) fusion of users’ speech and gesture inputs to understand their intents and associated arguments. The system was evaluated in a living room setup with live test subjects on a real-time implementation of the multimodal dialog system. Users interacted with a television browser using gestures and speech. Gestures were captured by Microsoft Kinect skeleton tracking and speech was recorded by a Kinect microphone array. Results show a 16% error rate reduction (ERR) for contextual ASR adaptation to clickable web page content, and 7-10% ERR when using gestures with speech. Analysis of the results suggest a strategy for selection of multimodal intent when users clearly and persistently indicate pointing intent (e.g., eye gaze), giving a 54.7% ERR over lexical features.
منابع مشابه
Capacitated Single Allocation P-Hub Covering Problem in Multi-modal Network Using Tabu Search
The goals of hub location problems are finding the location of hub facilities and determining the allocation of non-hub nodes to these located hubs. In this work, we discuss the multi-modal single allocation capacitated p-hub covering problem over fully interconnected hub networks. Therefore, we provide a formulation to this end. The purpose of our model is to find the location of hubs and the ...
متن کاملUsing hotspots as a novel method for accessing key events in a large multi-modal corpus
In 2009 we created the D64 corpus, a multi-modal corpus which consists of roughly eight hours of natural, non-directed spontaneous interaction in an informal setting. Five participants feature in the recordings and their conversations were captured by microphones (room, body mounted and head mounted), video cameras and a motion capture system. The large amount of video, audio and motion capture...
متن کامل"Name That Song!" A Probabilistic Approach to Querying on Music and Text
We present a novel, flexible statistical approach for modelling music and text jointly. The approach is based on multi-modal mixture models and maximum a posteriori estimation using EM. The learned models can be used to browse databases with documents containing music and text, to search for music using queries consisting of music and text (lyrics and other contextual information), to annotate ...
متن کاملHow Do I Address You? Modelling addressing behaviour based on an analysis of multi-modal corpora of conversational discourse
Addressing is a special kind of referring and thus principles of multi-modal referring expression generation will also be basic for generation of address terms and addressing gestures for conversational agents. Addressing is a special kind of referring because of the different (second person instead of object) role that the referent has in the interaction. Based on an analysis of addressing beh...
متن کاملDifferent Approaches to Build Multilingual Conversational Systems
The paper describes developments and results of the work being carried out during the European research project CATCH-2004 (Converse in AThens Cologne and Helsinki). The objective of the project is multi-modal, multi-lingual conversational access to information systems. This paper concentrates on issues of the multilingual telephony-based speech and natural language understanding components.
متن کامل